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The recent advent of next-generation sequencing technologies has dramatically changed the nature of biomedical research. 
Human genetics is no exception - it has never been easier to interrogate human patient genomes at the nucleotide level to 
identify disease-associated variants. To further facilitate the efficiency of this approach, whole exome sequencing (WES) was 
first developed in 2009. Over the past three years, multiple groups have demonstrated the power of WES through robust 
disease-associated variant discoveries across a diverse spectrum of human diseases. Here, we review the application of WES 
to different types of inherited human diseases and discuss analytical challenges and possible solutions, with the aim of 
providing a practical guide for the effective use of this technology. 
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Introduction 

Whole exome sequencing (WES) is a technique to 
selectively capture and sequence the coding regions of all 
annotated protein-coding genes. Coupled with next-gene- 
ration sequencing (NGS) platforms, it enables the analysis of 
functional regions of the human genome with unpre- 
cedented efficiency. Since its first reported application [1,2], 
WES has emerged as a powerful and popular tool for 
researchers elucidating genetic variants underlying human 
diseases (Fig. 1) despite certain limitations (see below). 

Overview of WES Pipeline 

WES pipelines generally follow a similar process, regard- 
less of the capture method and NGS platform used, as 
summarized in Fig. 2. The experimental pipeline can be 
divided into two parts: 1) preparing genomic DNA libraries 
and hybridizing them to capture arrays and 2) NGS of the 
eluted target fragments. There are a number of commercially 
available capture arrays, and their strengths and weaknesses 
have been well described elsewhere [3]. Once short se- 
quencing reads have been generated, they are mapped to the 
reference human genome, and variant calling is carried out. 



Subsequent annotation of these variants is necessary to 
further evaluate their potential biological effect; publicly 
available software and databases can be used for this 
purpose. 

Strengths and Weaknesses of WES 

WES is a robust technology that is extremely practical for 
investigating coding variation at the genome-wide level. 
Despite the plummeting cost of sequencing in the past few 
years, WES at mean coverage depth of 100 x still costs five 
times less than whole genome sequencing (WGS) at mean 
coverage depth 30 x. In addition, the size of WES data per 
patient is approximately a sixth of WGS data, resulting in 
reduced processing time and imposing less of a burden in 
terms of data storage. 

However, it is important to understand the limitations of 
WES technology. Certain protein-coding regions might not 
be covered due to incomplete annotation of the human 
genome. Further, WES does not cover potentially functional 
non-coding elements, including untranslated regions, en- 
hancers, and long-noncoding RNAs, although these are, in 
themselves, not clearly defined. Another drawback to WES is 
the limited ability to detect structural variations, such as 
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copy-number variations, translocations, and inversions. 

In spite of these limitations, WES is still the tool of choice 
for many researchers— its practical advantages allow large 
numbers of patients to be screened in a robust fashion, a 
crucial aspect of mutation discovery in human genetics 
research. 



Application of WES to Various Human Dise- 
ase Types 

WES has been used in identifying genetic variants asso- 
ciated with a variety of diseases. Here, we discuss different 
categories of inherited human disease and the methodology 
employed to examine each case. 
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Fig. 1. Increasing number of whole exome sequencing papers over 
time. 



Mendelian Diseases, Recessive 

Diseases inherited in a recessive pattern have traditionally 
been highly amenable to genetic analysis, due to the fact that 
homozygous variants are easily detectable. Previously, if a 
large family with many affected members was available for 
pedigree analysis, one would perform linkage analysis by 
genotyping family members in order to identify relatively 
short genomic intervals that presumably contained the 
disease-associated variant. Direct interrogation of entire 
genomes for homozygous variants is now possible with NGS 
technologies, and public datasets can be used to exclude 
common variants that are less probable to be disease- 
causing. Identifying recessive variants is especially straight- 
forward if the proband is a product of a consanguineous 
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Fig. 2. Overview of whole exome sequen- 
cing pipeline. SNV, single nucleotide 
variant. 
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marriage — sequencing the entire exome of the patient would 
likely yield a manageable number of rare homozygous 
variants, and further studies could be performed to inve- 
stigate the functional relevance of these variants. 

In one of the earliest reports of the application of WES 
technology, Bilgiivar et al. [4] identified recessive mutations 
in WDR62 in multiple Turkish consanguineous patients with 
severe developmental brain defects. Very little was known 
about WDR62 at that time, but subsequent genetic and 
functional studies have since validated the importance of 
this gene in proper brain development. More recently, a large 
cohort of patients of similar genetic status and clinical 
manifestation were analyzed using homozygosity mapping 
from WES data [5]. The authors uncovered 22 genes not 
previously identified as disease-causing, further demon- 
strating the power of WES as an ideal tool for gene discovery. 

Mendelian Diseases, Dominant 

Mendelian diseases with dominant modes of inheritance 
pose greater technical challenges for genetic analysis. Hete- 
rozygous variants are generally more difficult to detect and 
analyze, not only due to their sheer number as compared to 
homozygous variants but also because they are subjected to 
higher false positive and false negative errors. Nevertheless, 
one could still design one of the following experiments to 
elucidate the genetic architecture underlying these diseases, 
depending on the availability of patient cohorts and the effect 
of the disease on reproductive fitness. 

Familial cohort 

If large families with the disease of interest are available, 
WES should be performed on at least two affected family 
members. In analyzing the results, any shared variants could 
potentially be disease-causing; using linkage data to focus on 
the variants located in linked intervals would greatly reduce 
the number of variants to be considered. A recent example of 
such an approach is a WES study on familial amyotrophic 
lateral sclerosis (FALS) [6] . The authors recruited two large 
families with FALS and selected two affected members with 
the greatest genetic distance from each family for WES. 
Among the rare functional variants that were identified, they 
tested the shared variants for Mendelian segregation among 
affected family members, eventually narrowing their search 
down to a single common gene, PFN1 . 

Non-familial cohort 

When large families with the disease are not readily avai- 
lable, one could perform WES on a number of unrelated 
patients with similar clinical manifestations and select genes 
that are commonly mutated in the patient cohort. In carrying 



out such an experiment, it is important to consider the 
background mutation burden of the entire human gene set; 
it is possible that the longest genes might contain the highest 
number of rare functional variants and therefore appear to be 
the most interesting. To circumvent such errors in analysis, 
the variant burden of each gene should be normalized using 
data from a control population. In a WES study of pseu- 
dohypoaldosteronism type II (PHAII), the variant burden of 
every gene from 1 1 unrelated patients was compared against 
that of 699 controls [7] . A single gene that was specifically 
enriched for mutations in the patient set, KLHL3, was 
identified (5 variants from 1 1 patients compared to 2 from 
699 controls, with p- value of 1.1 x 10" 8 ). 

Non-inheritable diseases 

If the disease phenotype is severe enough such that it 
affects reproductive fitness, one could hypothesize that de 
novo variants - variants that emerged during meiosis of the 
germ cells that are unique to the offspring— are directly 
associated with the disease. These can be detected by 
performing trio-based WES on the unaffected parents and 
the proband, and variants that are present in the offspring 
only and not in the parents will be called de novo variants. The 
rate of de novo mutations has been shown to be relatively 
stable and affected primarily by paternal age [8]. WES of 
family quartets can also be carried out, where unaffected 
siblings are recruited simultaneously and their de novo 
mutations are compared against those of the proband. A 
series of studies involving autism trios and quartets recently 
demonstrated that rare de novo mutations were associated 
with the risk of autism and identified multiple de novo 
mutations in SCN2A, KATNAL2, and CHD8, implicating 
these genes in the genetic etiology underlying autism [9-11]. 

Common Complex Diseases 

The majority of the disease burden that modern societies 
endure can be attributed to common complex diseases. 
These diseases typically have both genetic and environ- 
mental causes, and they also possess significant genetic 
heterogeneity, making the identification of disease-causing 
genes very challenging. Traditionally, the use of linkage 
analysis of families with extreme disease phenotypes has 
been somewhat successful in mutation discovery [12]. 
However, these mutations are particular to patients with 
extreme phenotypes and do not explain the pathophysiology 
of disease for patients in the general population. Large 
genome-wide association studies of common variants have 
also been employed with limited success; the findings in 
many studies are not robust and only explain a small fraction 
of the disease risk [13]. As a result, greater attention has 
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been drawn to profiling rare variants with allele frequencies 
of less than 1% in complex diseases. 

Taking advantage of WES technology, several large-scale 
projects have been launched. One such example is a project 
led by National Heart, Lung, and Blood Institute (NHLBI) to 
discover novel genes underlying cardiovascular disorders. As 
a result of sequencing the exomes of 2,440 individuals, the 
study reported an excess of rare functional variants and 
concluded that large numbers of subjects will be required to 
attain sufficient power to discover variants that are 
significantly associated with the disease traits [14]. 

Challenges in Analysis 

WES yields extensive lists of genomic variation, and there 
are many important caveats to bear in mind when carrying 
out analysis of such data. In this section, we address the 
potential challenges that the investigators need to consider 
when designing or performing the experiments. 

False Positive and False Negative Calls 

Recent improvements of the WES technique in both 
experimental and bioinformatics pipelines have reduced the 
occurrence of false positive (false variant that is called true) 
and false negative (true variant that is failed to be called) 
variant calls substantially. Currently, we can achieve —98% 
sensitivity and 99.8% specificity from exome sequencing 
data, minimizing the chance of missing any true variants as 
compared to the analysis of SNP array data (data not shown) . 
However, as mentioned previously, it is technically chal- 
lenging to detect and evaluate rare heterozygous variants 
because heterozygous variant calling is more susceptible to 
the technical errors and require higher read depth. For 
example, at a coverage depth of 4x (i.e., if a base is covered 
by 4 independent reads) and assuming a 1% per-base error 
rate, a homozygous variant will be called if 0 and 4 reads are 
observed for the reference and nonreference alleles res- 
pectively, with a false positive rate of 2 x lfj 4 . However, if 
both reference and nonreference alleles have 2 x coverage, a 
heterozygous variant will be called with a false positive rate 
of 0.34. This presents a particular problem when one con- 
siders the uneven coverage in WES resulting from dif- 
ferential capture efficiencies across the exome. To achieve 
reliable heterozygous calling, it is therefore necessary to 
achieve sufficient coverage depth across the entire exome so 
as to minimize the number of bases with low coverage. 

Another challenge to analysis is read misalignments; the 
typical length of NGS reads is near or less than 100 bp, and 
even paired reads are subject to being improperly aligned, 
because there are large portions of the human genome where 



the DNA sequence is highly repetitive and duplicated. In 
fact, segmental duplicated regions, defined as intervals 
larger than 1 kb having a homology >90% with other parts of 
the genome, encompass about 5% of the human genome 
[15]. Hence, imposing more rigorous filtering criteria to 
remove such reads, including Phred score and mapping 
quality score cutoffs, is essential. 

Population Stratification 

The recent, dramatic acceleration in human population 
growth over the last several millennia has resulted in an 
excess of rare variants in the human genome, and these 
variants have not had enough time to be subjected to natural 
selection [14] . Since the identification of disease-associated 
variants consists largely of assessing rare variants, one must 
bear in mind the possibility that a variant of interest could be 
population-specific and not necessarily disease-causing. 
This is especially problematic if the patient cohort is not of 
European or, more specifically, northwestern European des- 
cent. Recent public sequencing projects, such as the 1000 
Genomes Project and NHLBI Go Exome Sequencing Project, 
have covered a greater variety of non-European ethnic 
groups than before, and they report that these groups harbor 
a greater burden of rare variants as compared to Europeans 
[14, 16]. For example, indigenous African individuals may 
carry 2-3 times more rare variants, and East Asian indi- 
viduals may carry 1.5 times more variants when compared to 
individuals of European descent. Rare variants specific to the 
population of study might be easily mistaken as patient- 
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Fig. 3. Increased sensitivity to detect causal gene as patient cohort 
size increases. Percentages shown denote ratio of patients that carry 
rare functional variants. Simulation was performed assuming that 
0.1% of healthy controls carry rare functional variants, and was 
iterated 10 5 times. Each iteration was evaluated and called 
'detected' if the p-value exceeded the genome-wide significance. 
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enriched variants; it is thus necessary to screen the pre- 
sumed variant with carefully matched healthy individuals. 

Locus Heterogeneity 

Locus heterogeneity describes the phenomenon where a 
single disease can be caused by multiple different loci across 
different patients. Over the last few decades, studies per- 
forming linkage analysis followed by positional cloning have 
identified many genes as being responsible for various 
inherited diseases; the Online Mendelian Inheritance in Man 
(OMIM) database currently records 3,000 genes as being 
disease-causing (http://www.ncbi.nlm.nih.gov/omim) . 
This also suggests that the majority of the remaining dise- 
ases without known associations with genes have not been 
amenable to classical approaches and will likely display 
extensive genetic heterogeneity. Increasing the size of the 
patient cohort will increase the power to discover variants, as 
shown in Fig. 3. Pathway or gene ontology analyses can also 
be performed to link the mutated genes into certain 
functional categories— there are several publicly available 
tools for this purpose. A good example of this method is the 
aforementioned study on PHAII— the authors initially 
identified KLHL3 from WES of 1 1 patients and subsequently 
considered KLHL3's presumed functional partner, CUL3, as a 
potential gene candidate. Extending the screening to 52 
patients, KLHL3 was found to be mutated in 24 patients 
(46.2%) and CUL3 in another 17 patients (32.7%), explai- 
ning a total of 78.8% of the entire patient set. Another 
possible approach is to re-evaluate the clinical diagnosis of 
patients in the cohort by correlating subtle differences of 
clinical measurements and genetic variants in order to focus 
the analysis on a more clinically homogeneous set of patients 
as a separate cohort. 

Conclusion 

There is no doubt that NGS has provided researchers with 
unprecedented power to resolve the genetic etiology of 
various human diseases. When WES was first introduced, its 
utility was highly debated due to its apparent limitations, 
such as incomplete coverage of functional elements and low 
sensitivity for structural variant detection. However, the 
practical advantages of the technology have made it a favored 
tool for researchers, and WES will likely continue to be 
widely used for the foreseeable future. Furthermore, as 
library capture methods and data analysis pipelines improve, 
increasing amounts of genomic information, aside from 
single-nucleotide changes and short indels, can be extracted. 
Examples of this include homozygosity interval mapping, 
common SNP genotyping information for various popula- 



tion-level analyses, and detection of structural variants [17, 
18]. Finally, the strengths of WES— short turnaround times, 
low cost, and relatively easy data interpretation —make it an 
optimal tool for clinical diagnosis. The increased use of WES 
in the clinic will surely spur the development of personalized 
medicine and reinvent treatment practices in the near future 
[1, 19]. 
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